1. Motivation

The project is based on Amazon food data which consists of two datasets: product metadata and product reviews. The datasets we chose to work contain both Amazon product reviews and also metadata about each product, such as product name, product category, and related products. The two datasets supplement each other well, since one contains information about the reviewers of products, and the other contains information about the product that is being reviewed.

We thought it would be interesting to analyse the sentiment of Amazon food reviews, since reviews often are highly subjective yet descriptive. Furthermore, we would like to model the activity of Amazon users (those that write reviews) and try to identify the so-called "super users" that highly active and write many reviews using network science tools.

Finally, we model the related product recommendations as a network to investigate if a product of one category tends to recommend another product of the similar category.

All in all we want the user to gain a new perspective on user reviews and product recommendations.

2. Basic stats

Loading data

Below, we load the review dataset into a pandas dataframe. The dataset consists of product reviews and metadata of products in the "Grocery and Gourmet Food" category on Amazon and was collected by researchers at UCSD.

The original reviews dataset has 142.8 million reviews spanning from May 1996 - July 2014. However, we use only use a subset of the dataset where all users and items have at least 5 reviews in the "Grocery and Gourmet Food" category, called food 5-core. The food 5-core dataset is 108 MB and contains 151,254 reviews, with each review having 9 attributes:

Below, we load the metadata dataset into a pandas dataframe. The metadata dataset contains metadata for 171,761 products in the "Gorcery and Gourmet Food" category and is 182 MB. Each product has 9 attributes:

Review network

In the review network, the nodes are reviewers and there is an edge between nodes if they have reviewed the same product. It is an undirected, weighted graph. In order to build the network, all unique products in the review dataframe are found. For each unique product, all users who have reviewed that product are found, and edges are established between them.

Due to the sheer size of the network, it is infeasible to visualise the whole network. Therefore, we make a subset consisting of 500 randomly sampled products and construct a network using the reviewers that have reviewed the subset products. The code below does that.

Create edges for the undirected network and remove duplicates.

Create the reviewer network.

We can now visualise the network using netwulf.

Due to the size of the network, it is only visualised with a subset of reviewers that have reviewed the 500 randomly sampled products. The graph of the subset contains $5773$ nodes and $229767$ edges.

network%20%285%29.png

Culprits (with highest node degree):

Product network

In order to tie our text and network analysis better together, we remove all products in the product dataframe that are not also in the review dataframe.

We are only interested in products that have at least one subcategory other than "Grocery & Gourmet Food", since we want to look at how well these subcategories work as communities. Therefore we remove all products that do not have any subcategories

We prepare the edges for the product network, after which we can create the initial network:

We create a networkx graph which has 358 nodes and 2814 edges.

Because adding the edges added some of the nodes with only the "Grocery & Gourmet food" category that we removed before, we remove them again. Furthermore, each node is given an attribute which corresponds to its category.

Some stats for the product network:

We can now visualize the network

3. Tools, theory and analysis

Review network

In some rows, the numbers do not match up. For example, in row 1 some values are 204, while it is 203 in another. Turns out, sometimes the reviewer name is recorded as 'nan' (not a number)

It is worth noting that among the identified users with the highest node degree, only one of them (Nerd Alert) is among the users with the top 10 most reviews. However, keep in mind that the network has an edge between two nodes if they have reviewed the same product. This means if a product has been reviewed by many people, then the users who have reviewed that product will have many edges.

The user with the highest number of reviews and node degree is C. Hill "CFH". However, writing many reviews does not equate to having a high node degree. For instance, the user "Gary Peterson" has written the second most reviews with 180, yet only have a node degree of 2,374. Comparatively, another user, "bsg2004" have only written 75 reviews, but have a much higher node degree of 3,701. Based on the average review score, we see that even though the users "Gary Peterson" and "NYFB" have written many reviews, the products that they review are not commonly reviewed by others.

In conclusion, the reviewer network can be useful to identify some super users. However, the number of times a product has been reviewed needs to be accounted for, since it has a major impact on the node degree.

Random network

The reviewer network is an example of a real-world network constructed using real data. We would like to construct a random network that have some of the same properties as the reviewer network, and then compare them.

Using equation 3.2 from [1], we want to find the probability $p$ that a network with $N$ nodes has <$L$> expected edges:

$$ \begin{equation} \text{<}L\text{>} = p \cdot \frac{N(N-1)}{2} \end{equation} $$

Since there are $L=229767 = \text{<}L\text{>}$ edges in the network, the probability can be computed as:

$$ \begin{equation} p = \frac{2\cdot L}{N(N-1)} \end{equation} $$

With the found value of $p$, the average degree of the network $\text{<}k\text{>}$ is computed using equation 3.3 in [1]:

$$ \begin{equation} \text{<}k\text{>} = p(N-1) \end{equation} $$

Now, we build a random network with the same number of nodes as the reviewer network. With the probability $p = 0.039$, an edge is added between node $i$ and $j$.

Degree distribution

The degree distribution histograms are visualised with linear x-axis and y-axis. We tried with logarithmic axis on both plots, but it made the degree distribution for the reviewer network look Gaussian distributed which is misleading.

Degree distribution for reviewer network

Degree distribution for random network

Both networks have around the same average degree, but their distributions vary. The degree distribution for the reviewer network is much more spread out, with degrees ranging from 8 to 5155, and skewed towards the left, indicating that smaller degrees are the most frequent. This is reflected in the median degree of the reviewer network, which is much lower than median degree of the random network. On the other hand, the degree distribution for the random network has an almost identical mean and median, and it closely closely resembles a Gaussian distribution. According to [2], the degree distribution of a random network should follow a binomial distribution (which approximates a Gaussian distribution as the number of samples goes towards infinity).

For both networks we can compute the average clustering coefficient, which is a measure of the degree to which nodes tend to cluster together. Using equation 2.15 in [3], the local clustering coefficient for a node $i$ is computed as: $$ \begin{equation} C_i = \frac{2\cdot L_i}{k_i(k_i-1)} \end{equation} $$ where $L_i$ is the number of edges node $i$ has to its $k_i$ neighbours.

The average clustering coefficient is found by averaging the local clustering coefficient for all nodes. This is computed for both the reviewer network and the random network.

We see that the average clustering coefficient is much larger for the reviewer network than the random network with a similar amount of nodes and edges. This is consistent with what is described in [4]: one would expect a real-life network with N nodes and L edges to have much higher clustering coefficients than a random network of similar size.

Product Network

Here we take a closer look at a few select products, one of each category. Especially the products B009GCXEW4 and B0033HGLTG are interesting since B009GCXEW4 is a lone off-category node on an island mostly consisting of products with the "Beverages" category like B0033HGLTG.

We now want to look at communities. The communities made from the product categories are a focus of our analysis since we can get insight into how much Amazon tends to recommend products of the same category.

We want to calculate the modalities of the communities defined by the categories. The modality gives us a measure of how good the partition made from the category tags are, and allows us to compare it directly to other partitions, in particular a partition made with the Louvain algorithm. We define a function to calculate the modularity from the equation from the Network sience book: http://networksciencebook.com/chapter/9#modularity

"The modularity is a measure for the quality of the division of a network. A high modularity score means that the given division has divided the network into communities that are strongly interconnected and which does not have many connections outside of the community. The higher the modularity, the better the division is." [5]

We make a louvain split, in order to compare how good the category split is to it.

The louvain algorithm maximises modularity and agregates communities in a graph.[6]

Since it maximises modularity, we can directly compare its communities with those of the category partition. This will allow us to judge whether the related products are typically of the same category as the product they are related to, since if the modularity of the category partition is close to the louvain, it should be close to maximally interconnected

We can also visualize the louvain split network:

To get insight into the overlap between the Louvain and the category partitioning we make a confusion matrix.

This confusion matrix shows what community the products belongs to in both the category and louvain partitions. This makes it easier to see if the louvain algorithm mimics certain categories, and which categories are the most interconnected. For example, if a category is contained almost entirely within a louvain community, it means that the category community was strongly interconnected.

It is hard to see what's going on in the confusion plot since there are so many bins in the Louvain split. Therefore we remove the singleton nodes and repeat the analysis to see if we find something new. Singleton nodes do not affect the modularity as they are not connected to any other notes. Therefore, we can safely remove without changing the measures.

We then repeat the steps from before with the new singleton-free dataset.

The confusion matrix clearly shows that there are some communities in the category partition that are strongly interconnected. For example the "Baby Food" category is entirely contained within a Louvain community. Other larger categories like "Cooking & Baking" and "Beverages" are split up, which makes sense when you consider that they were bunched together in the plot.

Text analysis

TF-IDF

The code below tokenises the every review text, where text containts both the review and the summary.

To save time, this review_df is saved, and then loaded in the code snippet below:

A dictionary for each of the 358 products is created, containing all its tokens:

Now we want a dictionary containing the tokens for each subcategory:

We see that the categories containing only a single product each (Flowers & Plants and Gourmet Gifts) have a very small amount of tokens.

Now, we make the dictionaries for the dictionaries:

Below the TF scores alongside TF-IDF scores are shown:

We get rid of many of the "boring" words, such as "not", "like", "good", when using TF-IDF scores insteaf of purely TF.


Testing some TF-IDF scores for baby foods:

Wordclouds

The TF-IDF scores from before are now used to visualise wordclouds for the 6 biggest food categories.

First we create a mask_dictionary containing the 6 masks for the wordclouds:

Clipart images from:

https://icon-library.com/icon/at-icon-png-28.html burger
https://svgsilh.com/image/146690.html candy
https://www.istockphoto.com/search/2/image?mediatype=illustration&phrase=pacifier pacifier
https://cdn.create.vista.com/api/media/small/193917378/stock-vector-frying-pan-icon-in-flat pan
https://t4.ftcdn.net/jpg/02/74/47/29/240_F_274472965_p7RnT0D13PXObI2x033DwTL0l19FJv9y.jpg can
https://prosteps.cloudimg.io/v7m/resizeinbox/1000x1000/fsharp0/https://tilroy.s3.eu-west-1.amazonaws.com/472/product/14873454940331341306439275.png wine

Below we check how many times the following words "stevia", "xylitol", "keurig" are present in each food category:

Sentiment analysis

Now for the sentiment analysis. We start by loading the labTM dataframe, and then create a dictionary for ourselves of happiness scores:

We use the function for giving a happiness score given a list of tokens (from week 7)

We create a review_df2, which is indexed by product name ("asin"):

Creating a tokens column for each product:

We calculate happiness scores for the first 5000 reviews:

Some examples of reviews with low/high happiness scores:

Showing the 5 most negative reviews:

... and the 5 most positive review:

Now we create a happiness score for each of the 358 products:

We create a mean rating for each product and add it as a column in the dataframe:

Plotting the figure of sentiment vs. mean rating:

Testing if the linear correlation if significant with $N=358$ products:

The function stats.pearsonr outputs (correlation coefficient, p-value). The p-value is 0.0002, which clearly tells that there is a significant linear relationship between the mean rating and the happiness score of the products.

However, the correlation coefficient is only 0.196, indicating a not-so-powerful relationship between the two (likely due to the tabular method of the labTM happiness scores being too simple).


Calculating mean rating and mean sentiment for each food category (not used in the website):

Linear correlation plots for each food category:

4. Discussion

Network analysis

The product network gave satisfactory results, as we were able to confirm our hypothesis that the product category is a good partition with some distinct communities.

The random network, which was constructed to have similar properties to the reviewer network, behaved like a truly random network. The degree distribution of the random network looks like a binomial distribution, and its average clustering coefficient was much lower than the reviewer network.

The reviewer network did not highlight the super users as well as we hoped, since a user with many reviews did not necessarily have a large node degree in the network, since an edge is only established between two nodes (representing two different users), if they have reviewed the same product. Therefore, a user with many reviews could potentially have low node degree if no other users have reviewed the same product as them.

Text analysis

The correlation between the sentiment (happiness scores) and the mean ratings for the products could have been stronger, but it was significant, which was the main idea. It would be interesting to use a more advanced sentiment analysis method, e.g. with deep learning. With a more advanced model, we believe the correlation would be stronger.

For the wordclouds, we thought they showed the food categories pretty well, and we think the food shapes add a fun way to visualise.

5. References

[1] http://networksciencebook.com/chapter/3#number-of-links

[2] http://networksciencebook.com/chapter/3#degree-distribution

[3] http://networksciencebook.com/chapter/2#clustering

[4] http://networksciencebook.com/chapter/3#clustering-3-9

[5] Our Assignment 2

[6] https://towardsdatascience.com/louvain-algorithm-93fde589f58c

6. Work distribution